H1N1 Flu Vaccines - DrivenData competition

We have joined a machine learning competition in drivendata website, to predict how likely individuals are to receive their H1N1 and seasonal flu vaccines. We are predicting two probabilities:

  • The probability to receive h1n1 vaccine
  • The probability to receive seasonal flu vaccine

We have three datasets:

  • Training features
  • Training labels (validation data)
  • Test features (submission data)

Additionaly, we have a submission sample, which is necessary to succesfully make our submission.

The full documentation of the dataset is available in the competition official page.

Environment Settings

Before start the analysis, we import the necessary libraries:

In [ ]:
import pandas as pd
import plotly.express as px
import plotly
plotly.offline.init_notebook_mode()
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
sns.set(rc={'figure.figsize':(14,10)})

Exploring the data

Let's take a look at our training features:

In [ ]:
# Training Features
train_features = pd.read_csv('training_set_features.csv')
train_features.head()
Out[ ]:
respondent_id h1n1_concern h1n1_knowledge behavioral_antiviral_meds behavioral_avoidance behavioral_face_mask behavioral_wash_hands behavioral_large_gatherings behavioral_outside_home behavioral_touch_face ... income_poverty marital_status rent_or_own employment_status hhs_geo_region census_msa household_adults household_children employment_industry employment_occupation
0 0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 ... Below Poverty Not Married Own Not in Labor Force oxchjgsf Non-MSA 0.0 0.0 NaN NaN
1 1 3.0 2.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 ... Below Poverty Not Married Rent Employed bhuqouqj MSA, Not Principle City 0.0 0.0 pxcmvdjn xgwztkwe
2 2 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... <= $75,000, Above Poverty Not Married Own Employed qufhixun MSA, Not Principle City 2.0 0.0 rucpziij xtkaffoo
3 3 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 ... Below Poverty Not Married Rent Not in Labor Force lrircsnp MSA, Principle City 0.0 0.0 NaN NaN
4 4 2.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 ... <= $75,000, Above Poverty Married Own Employed qufhixun MSA, Not Principle City 1.0 0.0 wxleyezf emcorrxb

5 rows × 36 columns

In [ ]:
# Test Features
test_features = pd.read_csv('test_set_features.csv')
test_features.head()
Out[ ]:
respondent_id h1n1_concern h1n1_knowledge behavioral_antiviral_meds behavioral_avoidance behavioral_face_mask behavioral_wash_hands behavioral_large_gatherings behavioral_outside_home behavioral_touch_face ... income_poverty marital_status rent_or_own employment_status hhs_geo_region census_msa household_adults household_children employment_industry employment_occupation
0 26707 2.0 2.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 ... > $75,000 Not Married Rent Employed mlyzmhmf MSA, Not Principle City 1.0 0.0 atmlpfrs hfxkjkmi
1 26708 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Below Poverty Not Married Rent Employed bhuqouqj Non-MSA 3.0 0.0 atmlpfrs xqwwgdyp
2 26709 2.0 2.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 ... > $75,000 Married Own Employed lrircsnp Non-MSA 1.0 0.0 nduyfdeo pvmttkik
3 26710 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... <= $75,000, Above Poverty Married Own Not in Labor Force lrircsnp MSA, Not Principle City 1.0 0.0 NaN NaN
4 26711 3.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 ... <= $75,000, Above Poverty Not Married Own Employed lzgpxyit Non-MSA 0.0 1.0 fcxhlnwr mxkfnird

5 rows × 36 columns

Merging Train and Test Data

Since we want to scale and encode our data, we must merge both datasets to get the same transformations for the variables:

In [ ]:
full_data = pd.concat([train_features, test_features], axis=0)
full_data.head()
Out[ ]:
respondent_id h1n1_concern h1n1_knowledge behavioral_antiviral_meds behavioral_avoidance behavioral_face_mask behavioral_wash_hands behavioral_large_gatherings behavioral_outside_home behavioral_touch_face ... income_poverty marital_status rent_or_own employment_status hhs_geo_region census_msa household_adults household_children employment_industry employment_occupation
0 0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 ... Below Poverty Not Married Own Not in Labor Force oxchjgsf Non-MSA 0.0 0.0 NaN NaN
1 1 3.0 2.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 ... Below Poverty Not Married Rent Employed bhuqouqj MSA, Not Principle City 0.0 0.0 pxcmvdjn xgwztkwe
2 2 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... <= $75,000, Above Poverty Not Married Own Employed qufhixun MSA, Not Principle City 2.0 0.0 rucpziij xtkaffoo
3 3 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 ... Below Poverty Not Married Rent Not in Labor Force lrircsnp MSA, Principle City 0.0 0.0 NaN NaN
4 4 2.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 ... <= $75,000, Above Poverty Married Own Employed qufhixun MSA, Not Principle City 1.0 0.0 wxleyezf emcorrxb

5 rows × 36 columns

Splitting the data into H1N1 and Seasonal Vaccine

Our task is predict two different probabilities, we must split our data in two different DataFrames:

  • One for H1N1 vaccine
  • Other for seasonal flu vaccine

We can identify some specific columns related to each vaccine, looking at the name of the columns:

In [ ]:
full_data.columns
Out[ ]:
Index(['respondent_id', 'h1n1_concern', 'h1n1_knowledge',
       'behavioral_antiviral_meds', 'behavioral_avoidance',
       'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
       'chronic_med_condition', 'child_under_6_months', 'health_worker',
       'health_insurance', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
       'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
       'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'age_group',
       'education', 'race', 'sex', 'income_poverty', 'marital_status',
       'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa',
       'household_adults', 'household_children', 'employment_industry',
       'employment_occupation'],
      dtype='object')
In [ ]:
# Choosing columns based in the vaccine name
h1n1_cols = [col for col in full_data.columns if 'h1n1' in col]
seas_cols = [col for col in full_data.columns if 'seas' in col]
rest_cols = [col for col in full_data.columns if col not in (h1n1_cols+seas_cols)]
In [ ]:
h1n1_cols
Out[ ]:
['h1n1_concern',
 'h1n1_knowledge',
 'doctor_recc_h1n1',
 'opinion_h1n1_vacc_effective',
 'opinion_h1n1_risk',
 'opinion_h1n1_sick_from_vacc']
In [ ]:
seas_cols
Out[ ]:
['doctor_recc_seasonal',
 'opinion_seas_vacc_effective',
 'opinion_seas_risk',
 'opinion_seas_sick_from_vacc']
In [ ]:
rest_cols
Out[ ]:
['respondent_id',
 'behavioral_antiviral_meds',
 'behavioral_avoidance',
 'behavioral_face_mask',
 'behavioral_wash_hands',
 'behavioral_large_gatherings',
 'behavioral_outside_home',
 'behavioral_touch_face',
 'chronic_med_condition',
 'child_under_6_months',
 'health_worker',
 'health_insurance',
 'age_group',
 'education',
 'race',
 'sex',
 'income_poverty',
 'marital_status',
 'rent_or_own',
 'employment_status',
 'hhs_geo_region',
 'census_msa',
 'household_adults',
 'household_children',
 'employment_industry',
 'employment_occupation']

Now that we have isolated the related columns to each vaccine and the common columns, we can create two different DataFrames: one for each vaccine:

In [ ]:
# Data for each vaccine
h1n1_data = full_data[h1n1_cols + rest_cols]
seas_data = full_data[seas_cols + rest_cols]

Categorical and Numerical Data

We have numerical and categorical data in both datasets. Now we want to discard high cardinality categorical variables, because is not easy to make predictions with them. The threshold that we'll establish to filter the columns will be 10.

In [ ]:
# Keeping low cardinality columns
h1n1_categorical_cols = [col for col in h1n1_data.columns if h1n1_data[col].nunique() < 10 and h1n1_data[col].dtype == 'object']
seas_categorical_cols = [col for col in seas_data.columns if seas_data[col].nunique() < 10 and seas_data[col].dtype == 'object']

# Numerical columns
h1n1_numerical_cols = [col for col in h1n1_data.columns if h1n1_data[col].dtype in ['float64', 'int64']]
seas_numerical_cols = [col for col in seas_data.columns if seas_data[col].dtype in ['float64', 'int64']]

# Columns to keep
h1n1_data = h1n1_data[h1n1_categorical_cols + h1n1_numerical_cols]
seas_data = seas_data[seas_categorical_cols + seas_numerical_cols]

H1N1 Vaccine Data

In [ ]:
# h1n1 data, missings
h1n1_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 53415 entries, 0 to 26707
Data columns (total 29 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   age_group                    53415 non-null  object 
 1   education                    50601 non-null  object 
 2   race                         53415 non-null  object 
 3   sex                          53415 non-null  object 
 4   income_poverty               44495 non-null  object 
 5   marital_status               50565 non-null  object 
 6   rent_or_own                  49337 non-null  object 
 7   employment_status            50481 non-null  object 
 8   census_msa                   53415 non-null  object 
 9   h1n1_concern                 53238 non-null  float64
 10  h1n1_knowledge               53177 non-null  float64
 11  doctor_recc_h1n1             49095 non-null  float64
 12  opinion_h1n1_vacc_effective  52626 non-null  float64
 13  opinion_h1n1_risk            52647 non-null  float64
 14  opinion_h1n1_sick_from_vacc  52645 non-null  float64
 15  respondent_id                53415 non-null  int64  
 16  behavioral_antiviral_meds    53265 non-null  float64
 17  behavioral_avoidance         52994 non-null  float64
 18  behavioral_face_mask         53377 non-null  float64
 19  behavioral_wash_hands        53333 non-null  float64
 20  behavioral_large_gatherings  53256 non-null  float64
 21  behavioral_outside_home      53251 non-null  float64
 22  behavioral_touch_face        53159 non-null  float64
 23  chronic_med_condition        51512 non-null  float64
 24  child_under_6_months         51782 non-null  float64
 25  health_worker                51822 non-null  float64
 26  health_insurance             28913 non-null  float64
 27  household_adults             52941 non-null  float64
 28  household_children           52941 non-null  float64
dtypes: float64(19), int64(1), object(9)
memory usage: 12.2+ MB

Seasonal Vaccine Data

In [ ]:
# seas data, missings
seas_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 53415 entries, 0 to 26707
Data columns (total 27 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   age_group                    53415 non-null  object 
 1   education                    50601 non-null  object 
 2   race                         53415 non-null  object 
 3   sex                          53415 non-null  object 
 4   income_poverty               44495 non-null  object 
 5   marital_status               50565 non-null  object 
 6   rent_or_own                  49337 non-null  object 
 7   employment_status            50481 non-null  object 
 8   census_msa                   53415 non-null  object 
 9   doctor_recc_seasonal         49095 non-null  float64
 10  opinion_seas_vacc_effective  52501 non-null  float64
 11  opinion_seas_risk            52402 non-null  float64
 12  opinion_seas_sick_from_vacc  52357 non-null  float64
 13  respondent_id                53415 non-null  int64  
 14  behavioral_antiviral_meds    53265 non-null  float64
 15  behavioral_avoidance         52994 non-null  float64
 16  behavioral_face_mask         53377 non-null  float64
 17  behavioral_wash_hands        53333 non-null  float64
 18  behavioral_large_gatherings  53256 non-null  float64
 19  behavioral_outside_home      53251 non-null  float64
 20  behavioral_touch_face        53159 non-null  float64
 21  chronic_med_condition        51512 non-null  float64
 22  child_under_6_months         51782 non-null  float64
 23  health_worker                51822 non-null  float64
 24  health_insurance             28913 non-null  float64
 25  household_adults             52941 non-null  float64
 26  household_children           52941 non-null  float64
dtypes: float64(17), int64(1), object(9)
memory usage: 11.4+ MB

Filling missing data

Now we have to deal with missing data. In our case, the column with more missings income_poverty with around 4500 missing values. This represents like 20% of the data, which is important (assuming that we drop all rows). We want to drop this column. With the rest of the data, we want to simply fill it with the most common value.

In [ ]:
# Dropping income_poverty
h1n1_data.drop('income_poverty', axis=1, inplace=True)
seas_data.drop('income_poverty', axis=1, inplace=True)
In [ ]:
# Filling each column with the most common value
for col in h1n1_data.columns:
    h1n1_data[col] = h1n1_data[col].fillna(h1n1_data[col].mode()[0])
for col in seas_data.columns:
    seas_data[col] = seas_data[col].fillna(seas_data[col].mode()[0])

Preprocessing and preparing the data

Now that we have all the data together, it's a good idea to encode categorical variables before definitely splitting the data (we need to do this step to make train test split).

In [ ]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for column in h1n1_data.columns:
    if h1n1_data[column].dtype == 'object':
        h1n1_data[column] = le.fit_transform(h1n1_data[column])

for column in seas_data.columns:
    if seas_data[column].dtype == 'object':
        seas_data[column] = le.fit_transform(seas_data[column])
In [ ]:
# Splitting test and training data
h1n1_train_data = h1n1_data[:len(train_features)]
h1n1_test_data = h1n1_data[len(train_features):]

seas_train_data = seas_data[:len(train_features)]
seas_test_data = seas_data[len(train_features):]
In [ ]:
# Target and features
targets = pd.read_csv('training_set_labels.csv')
h1n1_target = targets[['respondent_id', 'h1n1_vaccine']]
seas_target = targets[['respondent_id', 'seasonal_vaccine']]

h1n1_full = h1n1_train_data.merge(h1n1_target, on='respondent_id')
seas_full = seas_train_data.merge(seas_target, on='respondent_id')
In [ ]:
# Train Test Split
from sklearn.model_selection import train_test_split

# Features
X_h1n1 = h1n1_full.drop(['respondent_id', 'h1n1_vaccine'], axis=1)
X_seas = seas_full.drop(['respondent_id', 'seasonal_vaccine'], axis=1)
# Targets
y_h1n1 = h1n1_full['h1n1_vaccine']
y_seas = seas_full['seasonal_vaccine']

# Splitting the data into training and validation data
X_train_h1n1, X_test_h1n1, y_train_h1n1, y_test_h1n1 = train_test_split(
    X_h1n1, y_h1n1, test_size=0.25, random_state=14)
X_train_seas, X_test_seas, y_train_seas, y_test_seas = train_test_split(
    X_seas, y_seas, test_size=0.25, random_state=14)
In [ ]:
# Scaling the data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Here, we want to scale also the test data, not only the validation data
X_train_h1n1_scaled = scaler.fit_transform(X_train_h1n1) # train data
X_test_h1n1_scaled = scaler.transform(X_test_h1n1) # validation data
h1n1_test_data = h1n1_test_data.drop('respondent_id', axis=1)
h1n1_test_data_scaled = scaler.transform(h1n1_test_data) # test data

X_train_seas_scaled = scaler.fit_transform(X_train_seas) # train data
X_test_seas_scaled = scaler.transform(X_test_seas) # validation data
seas_test_data = seas_test_data.drop('respondent_id', axis=1)
seas_test_data_scaled = scaler.transform(seas_test_data) # test data

Training models

We want to train two kind of models: boosting model and Neural Network.

In [ ]:
from sklearn import metrics

# In this dictionary we will store the scores of each model
performance = {}

# Function to plot the ROC AUC Curve of each model
def roc_auc(model, vaccine=None):
    '''Train a model and make predictions over validation 
    data, and returns ROC AUC Score and plots ROC AUC
    
    Args:
        - model: the model to make predictions
        - vaccine: 'h1n1' or 'seasonal'
    '''
    if vaccine == 'h1n1':
        X_train = X_train_h1n1_scaled
        X_test = X_test_h1n1_scaled
        y_train = y_train_h1n1
        y_test = y_test_h1n1
    else:
        X_train = X_train_seas_scaled
        X_test = X_test_seas_scaled
        y_train = y_train_seas
        y_test = y_test_seas

    # Training the model
    model.fit(X_train, y_train)
    # Predictions over validation data
    preds = model.predict_proba(X_test)
    # Plot ROC Curve
    fpr, tpr, thresholds = metrics.roc_curve(y_test, preds[:, 1])
    roc_auc = metrics.auc(fpr, tpr)
    display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
                                    estimator_name='example estimator')
    display.plot()
    plt.show()
    # ROC AUC Score
    print(f'ROC AUC Score {model.__class__.__name__} :', metrics.roc_auc_score(y_test, preds[:, 1]))
    # Appending the score to the dictionary
    if vaccine == 'h1n1':
        performance[model.__class__.__name__ + '_H1N1'] = metrics.roc_auc_score(y_test, preds[:, 1])
    else:
        performance[model.__class__.__name__ + '_Seasonal'] = metrics.roc_auc_score(y_test, preds[:, 1])

Model 1. Boosting

In [ ]:
# Boosting - H1N1
from xgboost import XGBClassifier

model_1_h1n1 = XGBClassifier()

# ROC Curve and ROC AUC
roc_auc(model_1_h1n1, 'h1n1')
C:\Program Files\Python37\lib\site-packages\xgboost\sklearn.py:1224: UserWarning:

The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].

[17:09:56] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
ROC AUC Score XGBClassifier : 0.8194323527370652
In [ ]:
# Boosting - Seasonal

model_1_seas = XGBClassifier()

# ROC Curve and ROC AUC
roc_auc(model_1_seas, 'seasonal')
[17:09:57] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
C:\Program Files\Python37\lib\site-packages\xgboost\sklearn.py:1224: UserWarning:

The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].

ROC AUC Score XGBClassifier : 0.8409562131009277

Model 2. Neural Network

In [ ]:
# NN - H1N1
input_shape = [X_train_h1n1_scaled.shape[1]]

from tensorflow import keras
from tensorflow.keras import layers

# Red Neuronal
nn1 = keras.Sequential([
    layers.BatchNormalization(input_shape=input_shape),
    layers.Dense(64, activation='relu'), 
    layers.BatchNormalization(),
    layers.Dropout(0.2), 
    layers.Dense(64, activation='relu'), 
    layers.BatchNormalization(),
    layers.Dropout(0.2),
    layers.Dense(64, activation='relu'), 
    layers.BatchNormalization(),
    layers.Dropout(0.2),
    layers.Dense(1, activation='sigmoid')
])

# Optimizer, Loss Function and Metric
nn1.compile(
    optimizer='adam', 
    loss='binary_crossentropy', 
    metrics=[keras.metrics.AUC()]
)

# Early Stopping
early_stopping = keras.callbacks.EarlyStopping(
    patience=5,
    min_delta=0.001,
    restore_best_weights=True,
)

# Training
history = nn1.fit(
    X_train_h1n1_scaled, y_train_h1n1,
    validation_data=(X_test_h1n1_scaled, y_test_h1n1),
    batch_size=64,
    epochs=200,
    callbacks=[early_stopping],
)

# Results
history_df = pd.DataFrame(history.history)
history_df.iloc[:, [0, 2]].plot(title="Cross-entropy")
history_df.iloc[:, [1, 3]].plot(title="AUC")

preds = nn1.predict(X_test_h1n1_scaled)

performance['NN_H1N1'] = metrics.roc_auc_score(y_test_h1n1, preds.ravel())

# Plot ROC Curve
fpr, tpr, thresholds = metrics.roc_curve(y_test_h1n1, preds)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
                                  estimator_name='example estimator')
display.plot()

plt.show()
Epoch 1/200
313/313 [==============================] - 6s 14ms/step - loss: 0.5188 - auc: 0.7154 - val_loss: 0.4025 - val_auc: 0.8150
Epoch 2/200
313/313 [==============================] - 4s 13ms/step - loss: 0.4257 - auc: 0.7864 - val_loss: 0.3860 - val_auc: 0.8260
Epoch 3/200
313/313 [==============================] - 4s 12ms/step - loss: 0.4101 - auc: 0.8047 - val_loss: 0.3828 - val_auc: 0.8292
Epoch 4/200
313/313 [==============================] - 4s 13ms/step - loss: 0.4011 - auc: 0.8141 - val_loss: 0.3816 - val_auc: 0.8307
Epoch 5/200
313/313 [==============================] - 4s 13ms/step - loss: 0.3980 - auc: 0.8177 - val_loss: 0.3796 - val_auc: 0.8317
Epoch 6/200
313/313 [==============================] - 4s 13ms/step - loss: 0.3977 - auc: 0.8188 - val_loss: 0.3799 - val_auc: 0.8311
Epoch 7/200
313/313 [==============================] - 4s 13ms/step - loss: 0.3954 - auc: 0.8203 - val_loss: 0.3805 - val_auc: 0.8305
Epoch 8/200
313/313 [==============================] - 4s 13ms/step - loss: 0.3925 - auc: 0.8237 - val_loss: 0.3802 - val_auc: 0.8297
Epoch 9/200
313/313 [==============================] - 4s 13ms/step - loss: 0.3916 - auc: 0.8249 - val_loss: 0.3796 - val_auc: 0.8312
Epoch 10/200
313/313 [==============================] - 4s 13ms/step - loss: 0.3905 - auc: 0.8259 - val_loss: 0.3810 - val_auc: 0.8293
In [ ]:
# NN - Seasonal
input_shape = [X_train_seas_scaled.shape[1]]

# Red Neuronal
nn2 = keras.Sequential([
    layers.BatchNormalization(input_shape=input_shape),
    layers.Dense(64, activation='relu'), 
    layers.BatchNormalization(),
    layers.Dropout(0.2), 
    layers.Dense(64, activation='relu'), 
    layers.BatchNormalization(),
    layers.Dropout(0.2),
    layers.Dense(64, activation='relu'), 
    layers.BatchNormalization(),
    layers.Dropout(0.2),
    layers.Dense(1, activation='sigmoid')
])

# Optimizer, Loss Function and Metric
nn2.compile(
    optimizer='adam', 
    loss='binary_crossentropy', 
    metrics=[keras.metrics.AUC()]
)

# Early Stopping
early_stopping = keras.callbacks.EarlyStopping(
    patience=5,
    min_delta=0.001,
    restore_best_weights=True,
)

# Training
history = nn2.fit(
    X_train_seas_scaled, y_train_seas,
    validation_data=(X_test_seas_scaled, y_test_seas),
    batch_size=64,
    epochs=100,
    callbacks=[early_stopping],
)

# Results
history_df = pd.DataFrame(history.history)
history_df.iloc[:, [0, 2]].plot(title="Cross-entropy")
history_df.iloc[:, [1, 3]].plot(title="AUC")

preds = nn2.predict(X_test_seas_scaled)

performance['NN_Seasonal'] = metrics.roc_auc_score(y_test_seas, preds.ravel())

# Plot ROC Curve
fpr, tpr, thresholds = metrics.roc_curve(y_test_seas, preds)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
                                  estimator_name='example estimator')
display.plot()

plt.show()
Epoch 1/100
313/313 [==============================] - 5s 12ms/step - loss: 0.5956 - auc_1: 0.7709 - val_loss: 0.4898 - val_auc_1: 0.8445
Epoch 2/100
313/313 [==============================] - 4s 13ms/step - loss: 0.5226 - auc_1: 0.8213 - val_loss: 0.4798 - val_auc_1: 0.8510
Epoch 3/100
313/313 [==============================] - 4s 12ms/step - loss: 0.5100 - auc_1: 0.8293 - val_loss: 0.4777 - val_auc_1: 0.8523
Epoch 4/100
313/313 [==============================] - 4s 13ms/step - loss: 0.5016 - auc_1: 0.8354 - val_loss: 0.4766 - val_auc_1: 0.8531
Epoch 5/100
313/313 [==============================] - 4s 12ms/step - loss: 0.4974 - auc_1: 0.8385 - val_loss: 0.4783 - val_auc_1: 0.8527
Epoch 6/100
313/313 [==============================] - 4s 13ms/step - loss: 0.4970 - auc_1: 0.8384 - val_loss: 0.4755 - val_auc_1: 0.8538
Epoch 7/100
313/313 [==============================] - 4s 12ms/step - loss: 0.4928 - auc_1: 0.8413 - val_loss: 0.4756 - val_auc_1: 0.8542
Epoch 8/100
313/313 [==============================] - 4s 12ms/step - loss: 0.4942 - auc_1: 0.8407 - val_loss: 0.4753 - val_auc_1: 0.8540
Epoch 9/100
313/313 [==============================] - 4s 13ms/step - loss: 0.4927 - auc_1: 0.8414 - val_loss: 0.4753 - val_auc_1: 0.8540
Epoch 10/100
313/313 [==============================] - 4s 12ms/step - loss: 0.4882 - auc_1: 0.8449 - val_loss: 0.4756 - val_auc_1: 0.8538
Epoch 11/100
313/313 [==============================] - 4s 13ms/step - loss: 0.4903 - auc_1: 0.8433 - val_loss: 0.4751 - val_auc_1: 0.8542

Results

Let's check our results:

In [ ]:
# Results DataFrame
scores_df = pd.DataFrame(performance, index=['ROC AUC Score']).T.sort_values('ROC AUC Score', ascending=False)
scores_df['Vaccine'] = ['Seasonal', 'Seasonal', 'H1N1', 'H1N1']
scores_df['model'] = scores_df.index
scores_df = scores_df.reset_index(drop=True)
scores_df
Out[ ]:
ROC AUC Score Vaccine model
0 0.853762 Seasonal NN_Seasonal
1 0.840956 Seasonal XGBClassifier_Seasonal
2 0.831647 H1N1 NN_H1N1
3 0.819432 H1N1 XGBClassifier_H1N1
In [ ]:
fig = px.bar(scores_df, x='model', y=scores_df.columns[:1], color='Vaccine', 
            title='Obtained Scores', labels={'value': 'ROC AUC Score'})
fig.update_layout(height=400, width=900, template='plotly_white',
                  autosize=False, showlegend=False, yaxis_range=[0.7, 0.85])
fig.show()

The best models was the neural networks in both cases.

Final Predictions and Submission to Competition

In [ ]:
# Predictions

# 1. H1N1 Vaccine
h1n1_preds = nn1.predict(h1n1_test_data_scaled)
# 2. Seasonal Vaccine
seas_preds = nn2.predict(seas_test_data_scaled)
In [ ]:
pd.Series(h1n1_preds.ravel())
Out[ ]:
0        0.058584
1        0.073270
2        0.503143
3        0.658012
4        0.174645
           ...   
26703    0.399162
26704    0.096368
26705    0.090683
26706    0.041032
26707    0.364036
Length: 26708, dtype: float32
In [ ]:
# Labels - respondent_id
resp_ids = test_features['respondent_id']
h1n1_vaccine = pd.Series(h1n1_preds.ravel())
seas_vaccine = pd.Series(seas_preds.ravel())
In [ ]:
# Submission DataFrame
submission = pd.concat([resp_ids, h1n1_vaccine, seas_vaccine], axis=1)
submission = submission.rename(columns={
    0: 'h1n1_vaccine', 1: 'seasonal_vaccine'})
submission
Out[ ]:
respondent_id h1n1_vaccine seasonal_vaccine
0 26707 0.058584 0.221534
1 26708 0.073270 0.027013
2 26709 0.503143 0.684195
3 26710 0.658012 0.890796
4 26711 0.174645 0.436948
... ... ... ...
26703 53410 0.399162 0.585251
26704 53411 0.096368 0.213560
26705 53412 0.090683 0.158435
26706 53413 0.041032 0.435021
26707 53414 0.364036 0.727593

26708 rows × 3 columns

In [ ]:
# Exporting the Data as CSV to make the submission
submission.to_csv('submission.csv', index=False)

Score in the Competition

We reach a score of 0.83, we are in the world top 25%. The best global scores are around 0.86 ROC AUC Score.